The search functionality is under construction.
The search functionality is under construction.

Keyword Search Result

[Keyword] Markov model(95hit)

41-60hit(95hit)

  • A Fully Consistent Hidden Semi-Markov Model-Based Speech Recognition System

    Keiichiro OURA  Heiga ZEN  Yoshihiko NANKAKU  Akinobu LEE  Keiichi TOKUDA  

     
    PAPER-Speech and Hearing

      Vol:
    E91-D No:11
      Page(s):
    2693-2700

    In a hidden Markov model (HMM), state duration probabilities decrease exponentially with time, which fails to adequately represent the temporal structure of speech. One of the solutions to this problem is integrating state duration probability distributions explicitly into the HMM. This form is known as a hidden semi-Markov model (HSMM). However, though a number of attempts to use HSMMs in speech recognition systems have been proposed, they are not consistent because various approximations were used in both training and decoding. By avoiding these approximations using a generalized forward-backward algorithm, a context-dependent duration modeling technique and weighted finite-state transducers (WFSTs), we construct a fully consistent HSMM-based speech recognition system. In a speaker-dependent continuous speech recognition experiment, our system achieved about 9.1% relative error reduction over the corresponding HMM-based system.

  • HMM-Based Mask Estimation for a Speech Recognition Front-End Using Computational Auditory Scene Analysis

    Ji Hun PARK  Jae Sam YOON  Hong Kook KIM  

     
    LETTER-Speech and Hearing

      Vol:
    E91-D No:9
      Page(s):
    2360-2364

    In this paper, we propose a new mask estimation method for the computational auditory scene analysis (CASA) of speech using two microphones. The proposed method is based on a hidden Markov model (HMM) in order to incorporate an observation that the mask information should be correlated over contiguous analysis frames. In other words, HMM is used to estimate the mask information represented as the interaural time difference (ITD) and the interaural level difference (ILD) of two channel signals, and the estimated mask information is finally employed in the separation of desired speech from noisy speech. To show the effectiveness of the proposed mask estimation, we then compare the performance of the proposed method with that of a Gaussian kernel-based estimation method in terms of the performance of speech recognition. As a result, the proposed HMM-based mask estimation method provided an average word error rate reduction of 61.4% when compared with the Gaussian kernel-based mask estimation method.

  • Random Texture Defect Detection Using 1-D Hidden Markov Models Based on Local Binary Patterns

    Hadi HADIZADEH  Shahriar BARADARAN SHOKOUHI  

     
    PAPER

      Vol:
    E91-D No:7
      Page(s):
    1937-1945

    In this paper a novel method for the purpose of random texture defect detection using a collection of 1-D HMMs is presented. The sound textural content of a sample of training texture images is first encoded by a compressed LBP histogram and then the local patterns of the input training textures are learned, in a multiscale framework, through a series of HMMs according to the LBP codes which belong to each bin of this compressed LBP histogram. The hidden states of these HMMs at different scales are used as a texture descriptor that can model the normal behavior of the local texture units inside the training images. The optimal number of these HMMs (models) is determined in an unsupervised manner as a model selection problem. Finally, at the testing stage, the local patterns of the input test image are first predicted by the trained HMMs and a prediction error is calculated for each pixel position in order to obtain a defect map at each scale. The detection results are then merged by an inter-scale post fusion method for novelty detection. The proposed method is tested with a database of grayscale ceramic tile images.

  • View Invariant Human Action Recognition Based on Factorization and HMMs

    Xi LI  Kazuhiro FUKUI  

     
    PAPER

      Vol:
    E91-D No:7
      Page(s):
    1848-1854

    This paper addresses the problem of view invariant action recognition using 2D trajectories of landmark points on human body. It is a challenging task since for a specific action category, the 2D observations of different instances might be extremely different due to varying viewpoint and changes in speed. By assuming that the execution of an action can be approximated by dynamic linear combination of a set of basis shapes, a novel view invariant human action recognition method is proposed based on non-rigid matrix factorization and Hidden Markov Models (HMMs). We show that the low dimensional weight coefficients of basis shapes by measurement matrix non-rigid factorization contain the key information for action recognition regardless of the viewpoint changing. Based on the extracted discriminative features, the HMMs is used for temporal dynamic modeling and robust action classification. The proposed method is tested using real life sequences and promising performance is achieved.

  • Performance Analysis of IEEE 802.11 DCF and IEEE 802.11e EDCA in Non-saturation Condition

    Tae Ok KIM  Kyung Jae KIM  Bong Dae CHOI  

     
    PAPER-Terrestrial Radio Communications

      Vol:
    E91-B No:4
      Page(s):
    1122-1131

    We analyze the MAC performance of the IEEE 802.11 DCF and 802.11e EDCA in non-saturation condition where device does not have packets to transmit sometimes. We assume that a flow is not generated while the previous flow is in service and the number of packets in a flow is geometrically distributed. In this paper, we take into account the feature of non-saturation condition in standards: possibility of transmission performed without preceding backoff procedure for the first packet arriving at the idle station. Our approach is to model a stochastic behavior of one station as a discrete time Markov chain. We obtain four performance measures: normalized channel throughput, average packet HoL (head of line) delay, expected time to complete transmission of a flow and packet loss probability. Our results can be used for admission control to find the optimal number of stations with some constraints on these measures.

  • Joint Blind Super-Resolution and Shadow Removing

    Jianping QIAO  Ju LIU  Yen-Wei CHEN  

     
    PAPER-Image Processing and Video Processing

      Vol:
    E90-D No:12
      Page(s):
    2060-2069

    Most learning-based super-resolution methods neglect the illumination problem. In this paper we propose a novel method to combine blind single-frame super-resolution and shadow removal into a single operation. Firstly, from the pattern recognition viewpoint, blur identification is considered as a classification problem. We describe three methods which are respectively based on Vector Quantization (VQ), Hidden Markov Model (HMM) and Support Vector Machines (SVM) to identify the blur parameter of the acquisition system from the compressed/uncompressed low-resolution image. Secondly, after blur identification, a super-resolution image is reconstructed by a learning-based method. In this method, Logarithmic-wavelet transform is defined for illumination-free feature extraction. Then an initial estimation is obtained based on the assumption that small patches in low-resolution space and patches in high-resolution space share a similar local manifold structure. The unknown high-resolution image is reconstructed by projecting the intermediate result into general reconstruction constraints. The proposed method simultaneously achieves blind single-frame super-resolution and image enhancement especially shadow removal. Experimental results demonstrate the effectiveness and robustness of our method.

  • A Style Control Technique for HMM-Based Expressive Speech Synthesis

    Takashi NOSE  Junichi YAMAGISHI  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:9
      Page(s):
    1406-1413

    This paper describes a technique for controlling the degree of expressivity of a desired emotional expression and/or speaking style of synthesized speech in an HMM-based speech synthesis framework. With this technique, multiple emotional expressions and speaking styles of speech are modeled in a single model by using a multiple-regression hidden semi-Markov model (MRHSMM). A set of control parameters, called the style vector, is defined, and each speech synthesis unit is modeled by using the MRHSMM, in which mean parameters of the state output and duration distributions are expressed by multiple-regression of the style vector. In the synthesis stage, the mean parameters of the synthesis units are modified by transforming an arbitrarily given style vector that corresponds to a point in a low-dimensional space, called style space, each of whose coordinates represents a certain specific speaking style or emotion of speech. The results of subjective evaluation tests show that style and its intensity can be controlled by changing the style vector.

  • Dynamic Bayesian Network Inversion for Robust Speech Recognition

    Lei XIE  Hongwu YANG  

     
    LETTER-Speech and Hearing

      Vol:
    E90-D No:7
      Page(s):
    1117-1120

    This paper presents an inversion algorithm for dynamic Bayesian networks towards robust speech recognition, namely DBNI, which is a generalization of hidden Markov model inversion (HMMI). As a dual procedure of expectation maximization (EM)-based model reestimation, DBNI finds the 'uncontaminated' speech by moving the input noisy speech to the Gaussian means under the maximum likelihood (ML) sense given the DBN models trained on clean speech. This algorithm can provide both the expressive advantage from DBN and the noise-removal feature from model inversion. Experiments on the Aurora 2.0 database show that the hidden feature model (a typical DBN for speech recognition) with the DBNI algorithm achieves superior performance in terms of word error rate reduction.

  • A Hidden Semi-Markov Model-Based Speech Synthesis System

    Heiga ZEN  Keiichi TOKUDA  Takashi MASUKO  Takao KOBAYASIH  Tadashi KITAMURA  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:5
      Page(s):
    825-834

    A statistical speech synthesis system based on the hidden Markov model (HMM) was recently proposed. In this system, spectrum, excitation, and duration of speech are modeled simultaneously by context-dependent HMMs, and speech parameter vector sequences are generated from the HMMs themselves. This system defines a speech synthesis problem in a generative model framework and solves it based on the maximum likelihood (ML) criterion. However, there is an inconsistency: although state duration probability density functions (PDFs) are explicitly used in the synthesis part of the system, they have not been incorporated into its training part. This inconsistency can make the synthesized speech sound less natural. In this paper, we propose a statistical speech synthesis system based on a hidden semi-Markov model (HSMM), which can be viewed as an HMM with explicit state duration PDFs. The use of HSMMs can solve the above inconsistency because we can incorporate the state duration PDFs explicitly into both the synthesis and the training parts of the system. Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized speech.

  • State Duration Modeling for HMM-Based Speech Synthesis

    Heiga ZEN  Takashi MASUKO  Keiichi TOKUDA  Takayoshi YOSHIMURA  Takao KOBAYASIH  Tadashi KITAMURA  

     
    LETTER-Speech and Hearing

      Vol:
    E90-D No:3
      Page(s):
    692-693

    This paper describes the explicit modeling of a state duration's probability density function in HMM-based speech synthesis. We redefine, in a statistically correct manner, the probability of staying in a state for a time interval used to obtain the state duration PDF and demonstrate improvements in the duration of synthesized speech.

  • A Systolic FPGA Architecture of Two-Level Dynamic Programming for Connected Speech Recognition

    Yong KIM  Hong JEONG  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:2
      Page(s):
    562-568

    In this paper, we present an efficient architecture for connected word recognition that can be implemented with field programmable gate array (FPGA). The architecture consists of newly derived two-level dynamic programming (TLDP) that use only bit addition and shift operations. The advantages of this architecture are the spatial efficiency to accommodate more words with limited space and the absence of multiplications to increase computational speed by reducing propagation delays. The architecture is highly regular, consisting of identical and simple processing elements with only nearest-neighbor communication, and external communication occurs with the end processing elements. In order to verify the proposed architecture, we have also designed and implemented it, prototyping with Xilinx FPGAs running at 33 MHz.

  • Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training

    Junichi YAMAGISHI  Takao KOBAYASHI  

     
    PAPER-Speech and Hearing

      Vol:
    E90-D No:2
      Page(s):
    533-543

    In speaker adaptation for speech synthesis, it is desirable to convert both voice characteristics and prosodic features such as F0 and phone duration. For simultaneous adaptation of spectrum, F0 and phone duration within the HMM framework, we need to transform not only the state output distributions corresponding to spectrum and F0 but also the duration distributions corresponding to phone duration. However, it is not straightforward to adapt the state duration because the original HMM does not have explicit duration distributions. Therefore, we utilize the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and we apply an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distributions. Furthermore, we propose an HSMM-based adaptive training algorithm to simultaneously normalize the state output and state duration distributions of the average voice model. We incorporate these techniques into our HSMM-based speech synthesis system, and show their effectiveness from the results of subjective and objective evaluation tests.

  • A Hybrid HMM/Kalman Filter for Tracking Hip Angle in Gait Cycle

    Liang DONG  Jiankang WU  Xiaoming BAO  

     
    LETTER-Biological Engineering

      Vol:
    E89-D No:7
      Page(s):
    2319-2323

    Movement of the thighs is an important factor for studying gait cycle. In this paper, a hybrid hidden Markov model (HMM)/Kalman filter (KF) scheme is proposed to track the hip angle during gait cycles. Within such a framework, HMM and KF work in parallel to estimate the hip angle and detect major gait events. This approach has been applied to study gait features of different subjects and compared with video based approach. Experimental results indicate that 1.) the swing angle of the hip can be detected with simple hardware configuration using biaxial accelerometers and 2.) the hip angle can be tracked for different subjects within the error range of -5°+5°.

  • HHMM Based Recognition of Human Activity

    Daiki KAWANAKA  Takayuki OKATANI  Koichiro DEGUCHI  

     
    PAPER-Face, Gesture, and Action Recognition

      Vol:
    E89-D No:7
      Page(s):
    2180-2185

    In this paper, we present a method for recognition of human activity as a series of actions from an image sequence. The difficulty with the problem is that there is a chicken-egg dilemma that each action needs to be extracted in advance for its recognition but the precise extraction is only possible after the action is correctly identified. In order to solve this dilemma, we use as many models as actions of our interest, and test each model against a given sequence to find a matched model for each action occurring in the sequence. For each action, a model is designed so as to represent any activity containing the action. The hierarchical hidden Markov model (HHMM) is employed to represent the models, in which each model is composed of a submodel of the target action and submodels which can represent any action, and they are connected appropriately. Several experimental results are shown.

  • A Style Adaptation Technique for Speech Synthesis Using HSMM and Suprasegmental Features

    Makoto TACHIBANA  Junichi YAMAGISHI  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER-Speech Synthesis

      Vol:
    E89-D No:3
      Page(s):
    1092-1099

    This paper proposes a technique for synthesizing speech with a desired speaking style and/or emotional expression, based on model adaptation in an HMM-based speech synthesis framework. Speaking styles and emotional expressions are characterized by many segmental and suprasegmental features in both spectral and prosodic features. Therefore, it is essential to take account of these features in the model adaptation. The proposed technique called style adaptation, deals with this issue. Firstly, the maximum likelihood linear regression (MLLR) algorithm, based on a framework of hidden semi-Markov model (HSMM) is presented to provide a mathematically rigorous and robust adaptation of state duration and to adapt both the spectral and prosodic features. Then, a novel tying method for the regression matrices of the MLLR algorithm is also presented to allow the incorporation of both the segmental and suprasegmental speech features into the style adaptation. The proposed tying method uses regression class trees with contextual information. From the results of several subjective tests, we show that these techniques can perform style adaptation while maintaining naturalness of the synthetic speech.

  • Training Augmented Models Using SVMs

    Mark J.F. GALES  Martin I. LAYTON  

     
    INVITED PAPER

      Vol:
    E89-D No:3
      Page(s):
    892-899

    There has been significant interest in developing new forms of acoustic model, in particular models which allow additional dependencies to be represented than those contained within a standard hidden Markov model (HMM). This paper discusses one such class of models, augmented statistical models. Here, a local exponential approximation is made about some point on a base model. This allows additional dependencies within the data to be modelled than are represented in the base distribution. Augmented models based on Gaussian mixture models (GMMs) and HMMs are briefly described. These augmented models are then related to generative kernels, one approach used for allowing support vector machines (SVMs) to be applied to variable length data. The training of augmented statistical models within an SVM, generative kernel, framework is then discussed. This may be viewed as using maximum margin training to estimate statistical models. Augmented Gaussian mixture models are then evaluated using rescoring on a large vocabulary speech recognition task.

  • What HMMs Can Do

    Jeff A. BILMES  

     
    INVITED PAPER

      Vol:
    E89-D No:3
      Page(s):
    869-891

    Since their inception almost fifty years ago, hidden Markov models (HMMs) have have become the predominant methodology for automatic speech recognition (ASR) systems--today, most state-of-the-art speech systems are HMM-based. There have been a number of ways to explain HMMs and to list their capabilities, each of these ways having both advantages and disadvantages. In an effort to better understand what HMMs can do, this tutorial article analyzes HMMs by exploring a definition of HMMs in terms of random variables and conditional independence assumptions. We prefer this definition as it allows us to reason more throughly about the capabilities of HMMs. In particular, it is possible to deduce that there are, in theory at least, no limitations to the class of probability distributions representable by HMMs. This paper concludes that, in search of a model to supersede the HMM (say for ASR), rather than trying to correct for HMM limitations in the general case, new models should be found based on their potential for better parsimony, computational requirements, and noise insensitivity.

  • Genetic Algorithm Based Optimization of Partly-Hidden Markov Model Structure Using Discriminative Criterion

    Tetsuji OGAWA  Tetsunori KOBAYASHI  

     
    PAPER-Speech Recognition

      Vol:
    E89-D No:3
      Page(s):
    939-945

    A discriminative modeling is applied to optimize the structure of a Partly-Hidden Markov Model (PHMM). PHMM was proposed in our previous work to deal with the complicated temporal changes of acoustic features. It can represent observation dependent behaviors in both observations and state transitions. In the formulation of the previous PHMM, we used a common structure for all models. However, it is expected that the optimal structure which gives the best performance differs from category to category. In this paper, we designed a new structure optimization method in which the dependence of the states and the observations of PHMM are optimally defined according to each model using the weighted likelihood-ratio maximization (WLRM) criterion. The WLRM criterion gives high discriminability between the correct category and the incorrect categories. Therefore it gives model structures with good discriminative performance. We define the model structure combination which satisfy the WLRM criterion for any possible structure combinations as the optimal structures. A genetic algorithm is also applied to the adequate approximation of a full search. With results of continuous lecture talk speech recognition, the effectiveness of the proposed structure optimization is shown: it reduced the word errors compared to HMM and PHMM with a common structure for all models.

  • Human Walking Motion Synthesis with Desired Pace and Stride Length Based on HSMM

    Naotake NIWASE  Junichi YAMAGISHI  Takao KOBAYASHI  

     
    PAPER

      Vol:
    E88-D No:11
      Page(s):
    2492-2499

    This paper presents a new technique for automatically synthesizing human walking motion. In the technique, a set of fundamental motion units called motion primitives is defined and each primitive is modeled statistically from motion capture data using a hidden semi-Markov model (HSMM), which is a hidden Markov model (HMM) with explicit state duration probability distributions. The mean parameter for the probability distribution function of HSMM is assumed to be given by a function of factors that control the walking pace and stride length, and a training algorithm, called factor adaptive training, is derived based on the EM algorithm. A parameter generation algorithm from motion primitive HSMMs with given control factors is also described. Experimental results for generating walking motion are presented when the walking pace and stride length are changed. The results show that the proposing technique can generate smooth and realistic motion, which are not included in the motion capture data, without the need for smoothing or interpolation.

  • Speech Synthesis with Various Emotional Expressions and Speaking Styles by Style Interpolation and Morphing

    Makoto TACHIBANA  Junichi YAMAGISHI  Takashi MASUKO  Takao KOBAYASHI  

     
    PAPER

      Vol:
    E88-D No:11
      Page(s):
    2484-2491

    This paper describes an approach to generating speech with emotional expressivity and speaking style variability. The approach is based on a speaking style and emotional expression modeling technique for HMM-based speech synthesis. We first model several representative styles, each of which is a speaking style and/or an emotional expression, in an HMM-based speech synthesis framework. Then, to generate synthetic speech with an intermediate style from representative ones, we synthesize speech from a model obtained by interpolating representative style models using a model interpolation technique. We assess the style interpolation technique with subjective evaluation tests using four representative styles, i.e., neutral, joyful, sad, and rough in read speech and synthesized speech from models obtained by interpolating models for all combinations of two styles. The results show that speech synthesized from the interpolated model has a style in between the two representative ones. Moreover, we can control the degree of expressivity for speaking styles or emotions in synthesized speech by changing the interpolation ratio in interpolation between neutral and other representative styles. We also show that we can achieve style morphing in speech synthesis, namely, changing style smoothly from one representative style to another by gradually changing the interpolation ratio.

41-60hit(95hit)